{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OzsKRhEASUSU"
   },
   "source": [
    "<center>\n",
    "<p style=\"text-align:center\">\n",
    "<img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
    "<br>\n",
    "<br>\n",
    "<a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
    "|\n",
    "<a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "|\n",
    "<a href=\"https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg?__hstc=259489365.a667dfafcfa0169c8aee4178d115dc81.1733501603539.1733501603539.1733501603539.1&__hssc=259489365.1.1733501603539&__hsfp=3822854628&submissionGuid=381a0676-8f38-437b-96f2-fc10875658df#/shared-invite/email\">Community</a>\n",
    "</p>\n",
    "</center>\n",
    "<h1 align=\"center\">Optimizing LLM as a Judge Prompts</h1>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "j9AqMz5Ya8SF"
   },
   "source": [
    "**What is LLM as a Judge?**\n",
    "\n",
    "An LLM as a Judge refers to using an LLM as a tool for evaluating and scoring responses based on predefined criteria. The LLM functions as a \"judge,\" making decisions and assigning scores based on the input it receives. Some common use cases include evaluations for hallucinations, user frustration, and summarization quality.\n",
    "\n",
    "**Why Do We Want to Refine These Prompts?**\n",
    "\n",
    "While LLMs are powerful tools for evaluation, their performance can be inconsistent. Factors like ambiguity in the prompt, biases in the model, or a lack of clear guidelines can lead to unreliable results. By fine-tuning your LLM prompts, you can improve the model's consistency, fairness, and accuracy, ensuring it delivers more reliable evaluations.\n",
    "\n",
    "In this tutorial, you will:\n",
    "\n",
    "- Generate an LLM as a Judge evaluation prompt and test it against a datset\n",
    "- Learn about various optimization techniques to improve the template, measuring accuracy at each step using Phoenix evaluations\n",
    "- Understand how to apply these techniques together for better evaluation across your specific use cases\n",
    "\n",
    "⚠️ You'll need an OpenAI Key for this tutorial.\n",
    "\n",
    "Let’s get started! 🚀\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pj7KP4PSn-Hv"
   },
   "source": [
    "# **Set Up Dependencies and Keys**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "opEJgtbQuDQa"
   },
   "outputs": [],
   "source": [
    "%pip install -q \"arize-phoenix>=8.0.0\" datasets openinference-instrumentation-openai"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Pjeyd2s4LwS-"
   },
   "source": [
    "Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also [connect to a self-hosted Phoenix instance](https://arize.com/docs/phoenix/deployment) if you'd prefer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "71runzUWjMcd"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "if not (phoenix_endpoint := os.getenv(\"PHOENIX_COLLECTOR_ENDPOINT\")):\n",
    "    phoenix_endpoint = getpass(\"🔑 Enter your Phoenix Collector Endpoint: \")\n",
    "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = phoenix_endpoint\n",
    "\n",
    "if not (phoenix_api_key := os.getenv(\"PHOENIX_API_KEY\")):\n",
    "    phoenix_api_key = getpass(\"🔑 Enter your Phoenix API key: \")\n",
    "os.environ[\"PHOENIX_API_KEY\"] = phoenix_api_key\n",
    "\n",
    "if not (openai_api_key := os.getenv(\"OPENAI_API_KEY\")):\n",
    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
    "os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GFckBnkWoE-m"
   },
   "source": [
    "# **Load Dataset into Phoenix**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UBzQIHfrbbsR"
   },
   "source": [
    "Phoenix offers many [pre-built evaluation templates](https://arize.com/docs/phoenix/evaluation/concepts-evals/evaluation) for LLM as a Judge, but often, you may need to build a custom evaluator for specific use cases.\n",
    "\n",
    "In this tutorial, we will focus on creating an LLM as a Judge prompt designed to assess empathy and emotional intelligence in chatbot responses. This is especially useful for use cases like mental health chatbots or customer support interactions.\n",
    "\n",
    "We will start by loading a dataset containing 30 chatbot responses, each with a score for empathy and emotional intelligence (out of 10). Throughout the tutorial, we’ll use our prompt to evaluate these responses and compare the output to the ground-truth labels. This will allow us to assess how well our prompt performs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NCbOu6zDSaUw"
   },
   "outputs": [],
   "source": [
    "import uuid\n",
    "\n",
    "from datasets import load_dataset\n",
    "\n",
    "from phoenix.client import Client\n",
    "\n",
    "px_client = Client()\n",
    "\n",
    "ds = load_dataset(\"syeddula/empathy_scores\")[\"test\"]\n",
    "ds = ds.to_pandas()\n",
    "unique_id = uuid.uuid4()\n",
    "\n",
    "# Upload the dataset to Phoenix\n",
    "dataset = px_client.datasets.create_dataset(\n",
    "    dataframe=ds, name=f\"empathy-{unique_id}\", input_keys=[\"AI_Response\", \"EI_Empathy_Score\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rk0q91jqi288"
   },
   "source": [
    "# **Generate LLM as a Judge Template using Meta Prompting**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "R2lemAVEqJPO"
   },
   "source": [
    "Before iterating on our template, we need to establish a prompt. Running the cell below will generate an LLM as a Judge prompt specifically for evaluating empathy and emotional intelligence. When generating this template, we emphasize:\n",
    "\n",
    "- Picking evaluation criteria (e.g., empathy, emotional support, emotional intelligence).\n",
    "\n",
    "- Defining a clear scoring system (1-10 scale with defined descriptions).\n",
    "\n",
    "- Setting response formatting guidelines for clarity and consistency.\n",
    "\n",
    "- Including an explanation for why the LLM selects a given score.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "5Zov0gNyigqB"
   },
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "\n",
    "client = OpenAI()\n",
    "\n",
    "\n",
    "def generate_eval_template():\n",
    "    meta_prompt = \"\"\"\n",
    "    You are an expert in AI evaluation and emotional intelligence assessment. Your task is to create a structured evaluation template for assessing the emotional intelligence and empathy of AI responses to user inputs.\n",
    "\n",
    "    ### Task Overview:\n",
    "    Generate a detailed evaluation template that measures the AI’s ability to recognize user emotions, respond empathetically, and provide emotionally appropriate responses. The template should:\n",
    "    - Emphasize that an AI input is given and that is what will be evaluated\n",
    "    - Include distinct evaluation criteria that assess different aspects of emotional intelligence.\n",
    "    - Define a scoring system on a scale of 1 to 10, ensuring a broad distribution of scores across different responses.\n",
    "    - Provide clear, tiered guidelines for assigning scores, distinguishing weak, average, and strong performance.\n",
    "    - Include a justification section requiring evaluators to explain the assigned score with specific examples.\n",
    "\n",
    "    ### Format:\n",
    "    Return the evaluation template as plain text, structured with headings, criteria, and a detailed scoring rubric. The template should be easy to follow and apply to real-world datasets.\n",
    "\n",
    "    ### Variables: There should be a spot to pass in an input AI response to evaluate at the beginning. Format it exactly like so: {input}. Explicitly say that this is the input that is being evaluated.\n",
    "\n",
    "    ### Scoring Guidelines:\n",
    "    - The scoring system must be on a **scale of 1 to 10** and encourage a full range of scores. Emphasize that output should only be a number 1-10 for the entire evaluation.\n",
    "    - Differentiate between strong, average, and weak responses using specific, well-defined levels.\n",
    "    - Require evaluators to justify scores\n",
    "\n",
    "    Do not include any concluding remarks such as 'End of Template' or similar statements. The template should end naturally after the final section.\n",
    "\n",
    "    LIMIT 175 WORDS.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    try:\n",
    "        response = client.chat.completions.create(\n",
    "            model=\"gpt-4\",\n",
    "            messages=[{\"role\": \"user\", \"content\": meta_prompt}],\n",
    "            temperature=0.9,  # High temperature for more creativity\n",
    "        )\n",
    "\n",
    "        return response.choices[0].message.content\n",
    "    except Exception as e:\n",
    "        return {\"error\": str(e)}\n",
    "\n",
    "\n",
    "print(\"Generating new evaluation template...\")\n",
    "EMPATHY_EVALUATION_PROMPT_TEMPLATE = generate_eval_template()\n",
    "print(\"Template generated successfully!\")\n",
    "print(EMPATHY_EVALUATION_PROMPT_TEMPLATE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Eb4vPvxYi6H9"
   },
   "source": [
    "# **Testing Our Initial Prompt**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iohxB_SOrB-T"
   },
   "source": [
    "Instrument the application to send traces to Phoenix:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "VQYncQqNwDgP"
   },
   "outputs": [],
   "source": [
    "from phoenix.otel import register\n",
    "\n",
    "tracer_provider = register(project_name=\"LLM-as-a-Judge\", auto_instrument=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Fh0qYuUOrFAc"
   },
   "source": [
    "Now that we have our baseline prompt, we need to set up two key components:\n",
    "\n",
    "- **Task**: The LLM as a Judge evaluation, where the model scores chatbot responses based on empathy and emotional intelligence.\n",
    "\n",
    "- **Evaluator**: A function that compares the LLM as a Judge output to the ground-truth labels from our dataset\n",
    "\n",
    "Finally, we run our experiment. With this setup, we can measure how well our prompt initially performs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9SGOcyLaqlUe"
   },
   "outputs": [],
   "source": [
    "from phoenix.evals import LLM, bind_evaluator, create_classifier, create_evaluator\n",
    "\n",
    "llm = LLM(provider=\"openai\", model=\"gpt-5\")\n",
    "\n",
    "empathy = create_classifier(\n",
    "    name=\"empathy\",\n",
    "    llm=llm,\n",
    "    prompt_template=EMPATHY_EVALUATION_PROMPT_TEMPLATE,\n",
    "    choices=list(map(str, range(1, 11))),\n",
    ")\n",
    "\n",
    "empathy_mapping = {\n",
    "    \"input\": lambda row: row[\"AI_Response\"],\n",
    "}\n",
    "\n",
    "empathy_evaluator = bind_evaluator(empathy, empathy_mapping)\n",
    "\n",
    "\n",
    "# experiment task\n",
    "def run_empathy_eval(input, evaluator):\n",
    "    print(input)\n",
    "    result = evaluator.evaluate(input)\n",
    "    return result[0]\n",
    "\n",
    "\n",
    "# experiment evaluator\n",
    "@create_evaluator(name=\"close_match\")\n",
    "def close_match(output, input):\n",
    "    print(output)\n",
    "    expected_score = int(input[\"EI_Empathy_Score\"])\n",
    "    predicted_score = int(output[\"label\"]) if output else -1\n",
    "    return abs(expected_score - predicted_score) <= 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "StAMD6AQGz8l"
   },
   "outputs": [],
   "source": [
    "from functools import partial\n",
    "\n",
    "from phoenix.client import AsyncClient\n",
    "\n",
    "async_client = AsyncClient()\n",
    "\n",
    "experiment = await async_client.experiments.run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=partial(run_empathy_eval, evaluator=empathy_evaluator),\n",
    "    experiment_name=\"baseline\",\n",
    "    evaluators=[close_match],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ukFvAGE0jBoy"
   },
   "source": [
    "# **Iteration 1: Improve Accuracy**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RhrCX1b9rv2b"
   },
   "source": [
    "If you find that your LLM as a Judge prompt has low accuracy, we can make adjustmenets to the prompt to improve that. In this section, we explore 2 techniques for this: few shot examples and keeping a human in the loop.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "xyLVDFtsc1t8"
   },
   "source": [
    "## Technique 1: Few Shot Examples\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "tUg4CBx1I38H"
   },
   "source": [
    "Few-shot examples help improve the accuracy of an LLM as a Judge prompt by providing clear reference points for evaluation. Instead of relying solely on general instructions, the model learns from labeled examples that demonstrate correct scoring and reasoning.\n",
    "\n",
    "By including a mix of high, medium, and low-scoring responses, we help the model:\n",
    "\n",
    "- Understand nuanced criteria like empathy and emotional intelligence.\n",
    "\n",
    "- Reduce inconsistencies by aligning with real-world judgments.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "XBzx37iSeEqt"
   },
   "outputs": [],
   "source": [
    "few_shot_examples = \"\"\"\n",
    "---\n",
    "To help guide your reasoning, compare the text to the following examples and scorings:\n",
    "\n",
    "Example 1: \"You should try to focus and get your work done. Everyone has tough days.\"\n",
    "  Score: 2\n",
    "  Explanation: No acknowledgment of the user’s emotions, dismissive and offers no real support\n",
    "\n",
    "Example 2: \"That sounds really stressful. Have you considered talking to a supervisor or taking breaks?\"\n",
    "  Score: 5\n",
    "  Explanation: Acknowledges stress, but in a generic way. Provides advice, but not very personal. Could be warmer in tone.\n",
    "\n",
    "Example 3: \"I’m really sorry you’re feeling this way. It’s completely understandable to feel overwhelmed. You’re not alone in this. Have you had a chance to take a break or talk to someone who can support you?\"\n",
    "  Score: 9\n",
    "  Explanation: Validates emotions, reassures the user, and offers support\n",
    "\"\"\"\n",
    "EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "QAmRZds4elsN"
   },
   "outputs": [],
   "source": [
    "empathy = create_classifier(\n",
    "    name=\"empathy\",\n",
    "    llm=llm,\n",
    "    prompt_template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,\n",
    "    choices=list(map(str, range(1, 11))),\n",
    ")\n",
    "\n",
    "empathy_mapping = {\n",
    "    \"input\": lambda row: [row[\"AI_Response\"]],\n",
    "}\n",
    "\n",
    "empathy_evaluator = bind_evaluator(empathy, empathy_mapping)\n",
    "\n",
    "experiment = await async_client.experiments.run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=partial(run_empathy_eval, evaluator=empathy_evaluator),\n",
    "    experiment_name=\"few-shot\",\n",
    "    evaluators=[close_match],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sd-RW_c4d0_M"
   },
   "source": [
    "## Technique 2: Human in the Loop\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "3XKXwoCRJQAT"
   },
   "source": [
    "Keeping a human in the loop improves the accuracy of an LLM as a Judge by providing oversight, validation, and corrections where needed. In Phoenix, we can do this with [annotations](https://arize.com/docs/phoenix/tracing/features-tracing/how-to-annotate-traces). While LLMs can evaluate responses based on predefined criteria, human reviewers help:\n",
    "\n",
    "- Catch edge cases and biases that the model may overlook.\n",
    "\n",
    "- Refine scoring guidelines by identifying inconsistencies in LLM outputs.\n",
    "\n",
    "- Continuously improve the prompt by analyzing where the model struggles and adjusting instructions accordingly.\n",
    "\n",
    "However, human review can be costly and time-intensive, making full-scale annotation impractical. Fortunately, even a small number of human-labeled examples can significantly enhance accuracy.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sQw7NTX7LfDu"
   },
   "source": [
    "![Human Annotation](https://storage.googleapis.com/arize-phoenix-assets/assets/images/human_annotation.gif)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DO5Wn0Wmdl-F"
   },
   "source": [
    "# **Iteration 2: Reduce Bias**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yRgkXbQ7IVeI"
   },
   "source": [
    "## Style Invariant Evaluation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8ChXDKseJxIf"
   },
   "source": [
    "One common bias in LLM as a Judge evaluations is favoring certain writing styles over others. For example, the model might unintentionally rate formal, structured responses higher than casual or concise ones, even if both convey the same level of empathy or intelligence.\n",
    "\n",
    "To reduce this bias, we focus on style-invariant evaluation, ensuring that the LLM judges responses based on content rather than phrasing or tone. This can be achieved by:\n",
    "\n",
    "- Providing diverse few-shot examples that include different writing styles.\n",
    "\n",
    "- Testing for bias by evaluating responses with varied phrasing and ensuring consistent scoring.\n",
    "\n",
    "By making evaluations style-agnostic, we create a more robust scoring system that doesn’t unintentionally penalize certain tones.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "CzytVGYqIo-v"
   },
   "outputs": [],
   "source": [
    "style_invariant = \"\"\"\n",
    "----\n",
    "To help guide your reasoning, below is an example of how different response styles and tones can achieve similar scores:\n",
    "\n",
    "#### Scenario: Customer Support Handling a Late Order\n",
    "User: \"My order is late, and I needed it for an important event. This is really frustrating.\"\n",
    "\n",
    "Response A (Formal): \"I sincerely apologize for the delay...\"\n",
    "Response B (Casual): \"Oh no, that’s really frustrating!...\"\n",
    "Response C (Direct): \"Sorry about that. I’ll check...\"\n",
    "\"\"\"\n",
    "EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + style_invariant"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7OJ0RlAVKJpw"
   },
   "outputs": [],
   "source": [
    "empathy = create_classifier(\n",
    "    name=\"empathy\",\n",
    "    llm=llm,\n",
    "    prompt_template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,\n",
    "    choices=list(map(str, range(1, 11))),\n",
    ")\n",
    "\n",
    "empathy_mapping = {\n",
    "    \"input\": lambda row: [row[\"AI_Response\"]],\n",
    "}\n",
    "\n",
    "empathy_evaluator = bind_evaluator(empathy, empathy_mapping)\n",
    "\n",
    "experiment = await async_client.experiments.run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=partial(run_empathy_eval, evaluator=empathy_evaluator),\n",
    "    experiment_name=\"style invariant\",\n",
    "    evaluators=[close_match],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7noOpBiCdAYL"
   },
   "source": [
    "# **Iteration 3: Reduce Cost and Latency**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "oGX_ArMZJ_9M"
   },
   "source": [
    "Longer prompts increase computation costs and response times, making evaluations slower and more expensive. To optimize efficiency, we focus on condensing the prompt while preserving clarity and effectiveness. This is done by:\n",
    "\n",
    "- Removing redundant instructions and simplifying wording.\n",
    "\n",
    "- Using bullet points or structured formats for concise guidance.\n",
    "\n",
    "- Eliminating unnecessary explanations while keeping critical evaluation criteria intact.\n",
    "\n",
    "A well-optimized prompt reduces token count, leading to faster, more cost-effective evaluations without sacrificing accuracy or reliability.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rItz5WkYjqfd"
   },
   "outputs": [],
   "source": [
    "def generate_condensed_template():\n",
    "    meta_prompt = \"\"\"\n",
    "    You are an expert in prompt engineering and LLM evaluation. Your task is to optimize a given LLM-as-a-judge prompt by reducing its word count significantly while maintaining all essential information, including evaluation criteria, scoring system, and purpose.\n",
    "\n",
    "    Requirements:\n",
    "    Preserve all key details such as metrics, scoring guidelines, and judgment criteria.\n",
    "\n",
    "    Eliminate redundant phrasing and unnecessary explanations.\n",
    "\n",
    "    Ensure clarity and conciseness without losing meaning.\n",
    "\n",
    "    Maintain the prompt’s effectiveness for consistent evaluations.\n",
    "\n",
    "    Output Format:\n",
    "    Return only the optimized prompt as plain text, with no explanations or commentary.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    try:\n",
    "        response = client.chat.completions.create(\n",
    "            model=\"gpt-4\",\n",
    "            messages=[\n",
    "                {\n",
    "                    \"role\": \"user\",\n",
    "                    \"content\": \"Provided LLM-as-a-judge prompt\"\n",
    "                    + EMPATHY_EVALUATION_PROMPT_TEMPLATE,\n",
    "                },\n",
    "                {\"role\": \"user\", \"content\": meta_prompt},\n",
    "            ],\n",
    "            temperature=0.9,  # High temperature for more creativity\n",
    "        )\n",
    "\n",
    "        return response.choices[0].message.content\n",
    "    except Exception as e:\n",
    "        return {\"error\": str(e)}\n",
    "\n",
    "\n",
    "print(\"Generating condensed evaluation template...\")\n",
    "EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED = generate_condensed_template()\n",
    "print(\"Template generated successfully!\")\n",
    "print(EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "GQ0K_vAumE6-"
   },
   "outputs": [],
   "source": [
    "empathy = create_classifier(\n",
    "    name=\"empathy\",\n",
    "    llm=llm,\n",
    "    prompt_template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_CONDENSED,\n",
    "    choices=list(map(str, range(1, 11))),\n",
    ")\n",
    "\n",
    "empathy_mapping = {\n",
    "    \"input\": lambda row: [row[\"AI_Response\"]],\n",
    "}\n",
    "\n",
    "empathy_evaluator = bind_evaluator(empathy, empathy_mapping)\n",
    "\n",
    "experiment = await async_client.experiments.run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=partial(run_empathy_eval, evaluator=empathy_evaluator),\n",
    "    experiment_name=\"condensed\",\n",
    "    evaluators=[close_match],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "UVIUqnHuh546"
   },
   "source": [
    "# **Iteration 4: Self-Refinement (Iterative LLM as Judge)**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WnRsLA0jKk2I"
   },
   "source": [
    "Self-refinement allows a Judge to improve its own evaluations by critically analyzing and adjusting its initial judgments. Instead of providing a static score, the model engages in an iterative process:\n",
    "\n",
    "- Generate an initial score based on the evaluation criteria.\n",
    "\n",
    "- Reflect on its reasoning, checking for inconsistencies or biases.\n",
    "\n",
    "- Refine the score if needed, ensuring alignment with the evaluation guidelines.\n",
    "\n",
    "By incorporating this style of reasoning, the model can justify its decisions and self-correct errors.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "ybiGaaj-iBO7"
   },
   "outputs": [],
   "source": [
    "refinement_text = \"\"\"\n",
    "---\n",
    "After you have done the evaluation, follow these two steps:\n",
    "1. Self-Critique\n",
    "Review your initial score:\n",
    "- Was it too harsh or lenient?\n",
    "- Did it consider the full context?\n",
    "- Would others agree with your score?\n",
    "Explain any inconsistencies briefly.\n",
    "\n",
    "2. Final Refinement\n",
    "Based on your critique, adjust your score if necessary.\n",
    "- Only output a number (1-10)\n",
    "\"\"\"\n",
    "EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = EMPATHY_EVALUATION_PROMPT_TEMPLATE + refinement_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "1MZMNmYZiwyR"
   },
   "outputs": [],
   "source": [
    "empathy = create_classifier(\n",
    "    name=\"empathy\",\n",
    "    llm=llm,\n",
    "    prompt_template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,\n",
    "    choices=list(map(str, range(1, 11))),\n",
    ")\n",
    "\n",
    "empathy_mapping = {\n",
    "    \"input\": lambda row: [row[\"AI_Response\"]],\n",
    "}\n",
    "\n",
    "empathy_evaluator = bind_evaluator(empathy, empathy_mapping)\n",
    "\n",
    "experiment = await async_client.experiments.run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=partial(run_empathy_eval, evaluator=empathy_evaluator),\n",
    "    experiment_name=\"self-refinement\",\n",
    "    evaluators=[close_match],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "M-zJ_ASAkGlG"
   },
   "source": [
    "# **Iteration 5: Combining Techniques**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "cybHi2SPK2RD"
   },
   "source": [
    "To maximize the accuracy and fairness of our Judge, we will combine multiple optimization techniques. In this example, we will incorporate few-shot examples and style-invariant evaluation to ensure the model focuses on content rather than phrasing or tone.\n",
    "\n",
    "By applying these techniques together, we aim to create a more reliable evaluation framework.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "AknauDopLSBR"
   },
   "outputs": [],
   "source": [
    "EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED = (\n",
    "    EMPATHY_EVALUATION_PROMPT_TEMPLATE + few_shot_examples + style_invariant\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "4V0yDVSaLXR6"
   },
   "outputs": [],
   "source": [
    "empathy = create_classifier(\n",
    "    name=\"empathy\",\n",
    "    llm=llm,\n",
    "    prompt_template=EMPATHY_EVALUATION_PROMPT_TEMPLATE_IMPROVED,\n",
    "    choices=list(map(str, range(1, 11))),\n",
    ")\n",
    "\n",
    "empathy_mapping = {\n",
    "    \"input\": lambda row: [row[\"AI_Response\"]],\n",
    "}\n",
    "\n",
    "empathy_evaluator = bind_evaluator(empathy, empathy_mapping)\n",
    "\n",
    "experiment = await async_client.experiments.run_experiment(\n",
    "    dataset=dataset,\n",
    "    task=partial(run_empathy_eval, evaluator=empathy_evaluator),\n",
    "    experiment_name=\"combined\",\n",
    "    evaluators=[close_match],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "f1zdfnPUK7qF"
   },
   "source": [
    "# **Final Results**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "4YiWvs4xLCHR"
   },
   "source": [
    "Techniques like few-shot examples, self-refinement, style-invariant evaluation, and prompt condensation each offer unique benefits, but their effectiveness will vary depending on the task.\n",
    "\n",
    "**Note**: You may sometimes see a decline in performance, which is not necessarily \"wrong.\" Results can vary due to factors such as the choice of LLM and other inherent model behaviors.\n",
    "\n",
    "By systematically testing and combining these approaches, you can refine your evaluation framework.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dfEnskNwmCIp"
   },
   "source": [
    "![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/llm_as_a_judge_tutorial_results.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DKnciDvcoivT"
   },
   "source": [
    "From here, you can check out more [examples on Phoenix](https://arize.com/docs/phoenix/notebooks), and if you haven't already, [please give us a star on GitHub!](https://github.com/Arize-ai/phoenix) ⭐️\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
